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Abstract 



Implementation of basic local Monte-Carlo algorithms on ATI Graphics 
Processing Units {GPU) is investigated. The Ising model and pure SU{2) 
gluodynamics simulations are realized with the Compute Abstraction Layer 
(CAL) of ATI Stream environment using the Metropolis and the heat-bath 
algorithms, respectively. We present an analysis of both CAL programming 
model and the efficiency of the corresponding simulation algorithms on GPU. 
In particular, the significant performance speed-up of these algorithms in com- 
parison with serial execution is observed. 

(N 
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O ■ 1 Introduction 

cn 

. In recent years designers of graphics hardware start the new concept for games de- 

I velopers - to accelerate game physics directly on CPUs and, thus, to relieve less 

. efficient for such purposes Central Processor Units (CPUs) of any game physics 

computations. Originally the GPU was the interface unit between the system and 
monitor and has been optimized only for the system memory - video memory data 
KJi \ transferring and for storage of videopages. The next evolutionary step of GPU 

. was the appearance of graphics accelerators. They were designed for simple and 

I frequently used video operations (displacement of large blocks of images from one 

location to another on the screen, filling images, and so on) without much usage 
of CPU. In the early and mid-1990s, CPU-assisted real-time 3D graphics were be- 
coming increasingly common in personal computer and game consoles, which in 
turn lead to ever-growing customer's needs for hardware-accelerated 3D graphics. 
Over the past five years the computer graphics technology has evolved from the 
possibility of performing basic mathematical operations on image pixels to General 
Purpose Computing on GPU technology {GPGPU). 

Nowadays GPUs transformed into powerful programmable parallel processors 
providing, in particular, double-precision fioating point support, which is espe- 
cially essential for developing scientific applications. As an illustrative example, 
the peak performance of ATI RV 770 graphics processors exceeds 1 TFlops for 
single-precision operations and is about 250 GFlops for double-precision operations 
that is far beyond the performance level of up-to-date high-end multicore CPUs. 
Another important issue concerning GPU programming is availability of special pro- 
gramming toolkits aimed at development of non-graphics applications and allowing 
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a programmer to abstract away from sophisticated graphics apphcation interfaces 
(APIs). Corresponding tools, ATI Stream SDK and nVidia CUDA (Compute Uni- 
fied Device Architecture), can be freely downloaded from official sites of leading 
GPU-makers. No wonder that GPGPU attracts a lot of interest from the scientific 
community working in such areas as astrophysics ([i]-[6j), lattice QFT and spin 
systems ([2]-[Il]), numerical analysis of PDE's [12], molecular and fluid dynamics 
etc. (see, for instance [13] as well as [I4]-[17] for further references). Although there 
exist several reports on the topic, most of them are based on utilization of nVidia's 
CUDA toolkit while ATI programming environment seems to be less popular in 
the literature. With the present publication we start our numerical simulations on 
GPU with ATI Stream SDK to flll up this gap. 

One of the main aims of the present study is to implement existing Monte-Carlo 
(MCT) algorithms, widely employed for simulations in lattice QFT and spin models, 
on GPU and to investigate their effectiveness on such parallel systems |18)-[20j. 
Of course, one should always keep in mind that efficiency of a stochastic process 
is also strongly effected by autocorrelation times, which are different for differ- 
ent simulation schemes. That is, even though some particular algorithm is badly 
parallelized in principle, as non-local algorithms Hke the Wolff or Swendsen-Wang 
cluster algorithm, nevertheless it might be quite effective in comparison with some 
local algorithm, such as Metropolis algorithm that suffers from large autocorrelation 
time in the critical region (see however Ref. [21], where another local algorithm, 
the so-called worm algorithm, was investigated, that proved to be deprived of this 
drawback). However, we do not study efficiency of non-local updating schemes on 
GPU here. 

The paper organized as follows. The second section contains the implementation 
details for two local algorithms: the Metropolis and the heat-bath algorithms. In 
subsection 2.1 the common program model of GPU-oriented simulation apphcation 
is described. The features of used pseudo-random number generator is presented 
in subsection 2.2. Next subsection includes the description and performance results 
of 2d Ising model simulations for Metropolis algorithm on GPU while subsection 
2.4 contains some details and performance results of SU{2) gluodynamics simula- 
tions for heat-bath algorithm. In section 3 we discuss the obtained performance 
results and perspectives for prospective investigations. A short introduction into 
ATI Stream Technology is shown in the Appendix A. General information about 
ATI RV770 hardware is collected in the Appendix B. Some basic details of ATI 
Stream SDK is outlined in the Appendix C. Finally, in the Appendix D we provide 
information on the hardware and software configuration used in our Monte-Carlo 
simulations. 

2 Implementation 

We considered 2d Ising model and SU(2) gluodynamics as a perfect tool for testing 
various simulation algorithms. In this section we examine implementation of the 
Metropolis and the heat-bath updating algorithms on GPU for these two particular 
models, respectively. 

ATI Stream SDK allows to realize the algorithms by two software tools: Brook+ 
and CAL (for general information see Appendix C). We started to explore the ATI 
Stream technology with the Brook+, but we found out quickly that its usage does 
not allow to write GPGPU applications in the most efficient way. Brook+ is a 
very transparent and suitable language for stream programming, but it does not 
afford to implement some subtle points, needed for optimal programming on GPU. 
In particular, we employed compute shader model that provides more fiexibility for 
general purpose GPU computations. The basic ingredient of the compute shader is a 
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Figure 1: Structure scheme of simulation program. 



thread (while pixel shaders operate with pixels) , one invocation of a GPU computing 
program (or kernel) corresponding to a single element in the domain of execution. 
All threads initiated by the kernel are composed in thread groups and, within such 
a group, can communicate with each other via special hardware unit. Local Data 
Share, assigned to each SIMD core. For further details we refer to [33] and [34).) 
The critical requirement for performance is another essential argument in favor of 
the CAL use. The present research is done with the CAL. 

2.1 Program model 

Both the Ising model and SU (2) gluodynamics simulations are based on the same 
program model. As it is pointed in Appendix C, ATI Stream SDK does not support 
the execution of a kernel from another kernel. So, each kernel must perform a 
complete operation. All kernels are run by the host program. 

Another feature of the employed program model is that all necessary data for 
simulations are stored in GPU memory. GPU carries out intermediate actions and 
returns the results to host program for final data handling and output. We avoid 
any data transfer during run-time between the host program and kernels to speed-up 
the execution process. 

The structure scheme of simulation program is presented in Fig. 1. 

First of all we have to allocate GPU memory. The most convenient structure 
for this is the so-called global buffer, memory space containing the arbitrary address 
locations to which uncashed kernel outputs are written. ATI Stream SDK provides 
only one global buffer per kernel, therefore, all kernels must utilize the same global 
buffer to store all data. In addition, the memory allocation may be accompanied 
by setup of initial values for some internal variables in the global buffer. This is 
realized by preparing necessary data arrays on host system memory with subsequent 
mapping them into global buffer. 

The next execution stage is lattice initialization. Two standard initial states may 
be produced by the lattice initiaHzation kernels: ordered (cold) or disordered (hot) 
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configuration. Lattice initialization may be obviously done by host program before 
the memory setup kernel, however we carried out full lattice simulation process 
entirely on GPU. 

Thermalization and statistics sweeps are almost the same except for the mea- 
suring kernels, which are not involved in the thermalization process. The locality of 
the examined here algorithms implies, in particular, that the lattice can be divided 
into odd and even sites. To update odd sites, only information from the even site is 
necessary and vice versa. Hence, the complete MC sweep for all lattice requires the 
two groups of kernels execution (to update the odd and even sites, respectively). 
In working passes after each sweep the measuring kernels can be run in order to 
measure microstate parameters (microstate energy, correlations, etc). 

Last stage is the execution of postproduction kernels (such as statistical data 
handling). Afterwards, the export of all data to the host memory must be done. 

2.2 Pseudo-random number generator 

In our simulations we employed the 24-bit Marseglia pseudo-random number gen- 
erator RANMAB0 |22| . Here we used the implementation by James |23) . The 
algorithm is a combination of a lagged Fibonacci generator and a simple arithmetic 
sequence for the prime modulus 2^^ _ 3 = 16777213. The total period of RANMAR 
is about 2"4 « 2.23 x lO^^. 

The generator must be initialized by two given 5-digit integers {initial seeds), 
each set of which gives rise to an independent sequence of sufficient length for an 
entire calculation. The seed variables can have values between and 31328 for the 
first variable and and 30081 for the second variable, respectively. RANMAR can 
create, therefore, 900 million independent subsequences for different initial seeds 
with each subsequence having a length of approximately lO'^" random numbers. 
The 97 seed values and two offsets must be stored in GPU global memory for every 
random number generator used in simulations. So, it is reasonably to use a small 
number of random number generators, which is smaller than number of lattice sites 
for big lattices. As it was mentioned above, every GPU memory cell has four 32-bit 
components, so it is convenient also to choose the total number of generators to be 
multiplied by 4. 

The generator consists of two parts: 1) kernel, which produces the seed numbers 
on initial seed values; in fact this kernel is a replica of RMARIN subroutine of James' 
version [23], and 2) subroutine, which directly produces the random numbers. First 
part has been executed only for initialization, and the second one has been running 
as internal subroutine in all kernels that use random numbers (e.g. initial lattice 
equilibration, MC updates). It is not necessary to store generated random numbers 
in GPU memory in this case; such approach essentially accelerates the run process. 

In both Ising model and SU{2) gluodynamics simulations we used 4 x 8192 = 
32768 generators, and their number may be chosen in accordance with the lattice 
size. The lowest 10 bits of current thread number serve as identifier (ID) of random 
number generator. It is possible to change easily the actual number of generators by 
selecting any number of lowest bits in the thread ID. This realization allows to start 
all generators as independent parallel threads. In our implementation, RANMAR 
generators produce integer random numbers, which are mapped into the interval 
(of fioating point values) [0; 1) afterwards. 

rigorous analysis for a number of parallel pseudo-random generators can be found in |24| . 
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2.3 Metropolis algorithm for GPU 2d Ising model 
simulations 



2d Ising model, being the simplest and well-studied spin system, is a handy toy- 
model for GPU programming experiments. We remind the reader that subject to 
the character of the updating procedure, all conventional dynamic Monte-Carlo 
algorithms fall into two categories: namely, local and non-local algorithms. From 
the parallel computing perspective, former category of MC algorithms are usuahy 
the most convenient to deal with. Indeed, typically the updating scheme in this 
case requires local information only (e.g. information on close neighbors for a given 
spin in the lattice etc.); as a consequence, all stream processors work on their 
own distinct local data domain of the data with little queuing or synchronization 
overheads. On the contrary, the non-local algorithms are significantly harder to 
parallelize since they demand much data communications for stream processor^, 
the feature is especially unwelcome for GPU programming. As it was already noted, 
in the present study we are concerned about local MC update procedures, and the 
simplest of them is the Metropolis algorithm. Implementation of the Metropolis 
updates on parallel systems presents no difficulties since one can always make use 
of standard data decomposition of the lattice onto processors to achieve good load 
balance. Here we use two-dimensional Ising model to illustrate basic concepts of 
GPU parallel programming. 

The partition function for the Ising model is defined as 



with (Ji = ±1 being a spin variable at i site of the spin lattice. Here < i,j > 
denote the summation over nearest neighbor spins. It is quite natural to represent 
Ising spins as 1-bit variables and to store them in registers of the global buffer; 
in the case of GPU programming, this gives additional advantage since it allows 
to reduce memory fetches as well. The most essential part of the program is the 
spin lattice partition. There are several ways to do this. The simplest one is the 
checkerboard (or "red-black") partition, where each stream processor has just one 
spin. All spin lattice is divided into "red" and "black" squares, and all "red" (resp., 
"black") spins are then processed in parallel. More efficient way is to divide the 
system into small squares of some size and to assign each stream processor to one 
such square. Besides, one has also to take into account that the global buffer 
register is four-component. Here we proposed the scheme that very much resembles 
the well-known "direct-dual" decomposition of a spin lattice; this is shown in Fig. 
2. Black sites of the lattice interact with "dual" red sites, and are composed into 4 
component registers (in the figure it is indicated by gray painted rectangles; each 
rectangle represents memory organization in the global buffer) . Similarly, "red" sites 
are collected in the corresponding red painted rectangles and thus stored in global 
buffer registers. We found that such lattice organization allows to reduce memory 
usage and better suites for GPU simulations. 

Simulation process on GPU was performed in the following stages: 

1. random number seed array preparation; 

2. loading of a chosen initial configuration for the spin lattice in the GPU global 



3. equilibration run using the conventional Metropolis updating; 

^For instance, in cluster updating schemes the main computational task is to take local informa- 
tion (e.g., bond connections) and work out global information (i.e., cluster of sites). Investigation 
of the parallel cluster algorithms can be found in [25| . |26| . 
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Figure 2: "Hybrid" checkerboard decomposition of the Ising lattice. 



4. statistics run; 

5. finalization. 

These stages are actually common for both models (see below). In the case of 
Ising model simulation, the source code contains eight kernels. The first kernel is 
designed for generating seed number arrays, which are stored in the GPU global 
buffer and accessible for all remaining kernels. We also dedicate a single kernel for 
the spin lattice initialization (as usual, it defines hot or cold initial configuration). 
For equihbration run we use separate kernels to perform "even" and "odd" sweeps, 
while statistic runs are executed with the help of four kernels (two of them are used 
for sweeps and measurements, and the other two perform reduction operations and 
store obtained results for chosen lattice observables in the global memory). In the 
present study we consider lattice of medium size 1024 x 512 = 524288. In particular, 
we found that it takes about 630 sec for 10^ lattice updates during equihbration 
phase and 730 sec. for 10^ measurements during statistic run phase. 



2.4 SU{2) gluo dynamics simulations 

In our SU(2) gluodynamics simulations we used the hypercubic lattice Lt x 
(Lt < Ls) with hypertorus geometry; Lt and are the temporal and the spatial 
sizes of the lattice, respectively. It should be noted that implemented algorithm 
makes possible to work not only with spatial symmetric lattices {Li ^ L2 = L3 = 
Ls), but also with totahy asymmetric lattices {Li ^ L2 ^ L3 Lt). 

The total number of lattice sites is N sites = Li x L2 ^ L3 x Lt. As it was 
mentioned, we divided whole lattice into 2 parts - odd and even sites. The number 
of odd sites is N^^ect = N sites 1'^, as well as the number of even ones. 

The lattice may be initiahzed by cold or hot configuration. There are 2 possi- 
bilities for hot initialization to fih lattice with the pseudo-random SU{2) matrices: 
constant series (by predefined seeds) or random series (by system timer-dependent 
random seeds) of pseudo-random numbers. In first case every startup process would 
reproduce the same results for every execution (it is convenient for debugging pur- 
poses). 

Standard Wilson one-plaquette action of the SU{2) lattice gauge theory was 
used, 



1 - i Tr V^^ix) 



(2) 
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U^,(x) = Vf,{x)lJ,{x + a/i)Ut (x + ai))Vl{x), (3) 

where /? = is the lattice coupling constant, g is the bare coupling, U^(x) is the 
link variable located on the link leaving the lattice site x in the ^ direction, U^y(a;) 
is the ordered product of the link variables (here and below we omitted the group 
indices). 

The link variables Up(a;) are the SU{2) matrices and can be decomposed in 
terms of the unity, /, and PauH aj, matrices in the color space, 

V^ix)=IU'^{x)+ta,W^ix). (4) 

Each global buffer cell comprises four 32-bit numbers (single precision). Thus, 
it is natural to store all components of link matrices {U^{x), U^{x), Uf^(x) and 
U^{x)) as one cell. For example, if general purpose register is Rl = Up(a;), then 
the components of this register are Rl.x = U^{x), Rl.y — U^{x), Rl.z = W^ix) 
and Rl.w = U^{x). 

To update the lattice, heat-bath algorithm with overrelaxation was used [27], 
|28j . At every MC iteration we successively replaced each lattice link matrix with 

V^{x)^i]^{x)^wW^{x), (5) 

where 

W^{x)= J2 [ U,(a;)U^(a; + ai>)Ut(a; + a/i) (6) 

+\J1{x — aj))Up(a; — ai')\Ji,{x + a{jl — v)) , 

w is a normalization constant such that 

U^(x) G SU{2). (7) 

In order to thermahze the system we passed 1000 MC iterations for every run. 
For measuring we used 1000 MC configurations (separated by 5 updates). MC 
procedure contains four kernels (see Fig. 1): 

1. kernel for measuring the value W^(x) in eqn.Q for odd sites; 

2. kernel for updating the odd site's links; 

3. kernel for measuring the value W^(x) in eqn.® for even sites; 

4. kernel for updating the even site's links. 

First two kernels are executed sequently for each ^ direction. After their execution 
last two kernels are started for each ^ direction. Updating kernels have 8 attempts 
per start to modify the link matrices. 

The run-time parameters, such as numbers of thermalization and working MC 
sweeps, number of link updates per one MC sweep, etc., can be easily set up before 
execution. 

In our program model each thread operates only one lattice site. So, the stream 
for every kernel has Nyect elements. It is necessary to store iV„ect values of W^(a::) 
in global buffer during the MC procedure. Odd and even sites used the same region 
in global buffer for any ^. 

Working MC passes contain the measuring kernels, which calculate microstate 
energy. All averaging measurements were performed with the double precision. 
Every thread in measuring kernel produces two 64-bit values of local averages ([3]) 
- for spatial and temporal plaquettes, correspondingly. These values are stored in 
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global buffer and summed up by so-called reduction kernel, which decreases the 
dimensionality of a stream by folding along one axis. Reduction kernel does not 
use the GPU power in the best way, because at last execution stages less threads 
are run than it could be. Nevertheless, such kernels must be used to avoid the high 
memory consumption. The final double-precision values of temporal and spatial 
averages are stored in global buffer and transferred after all MC sweeps to the host 
for postproduction analysis. 





4x8'' 


4 X 16-' 


4 X 24^ 


4 X 32^ 


CPU, sec. 


30 


274 


1985 


3796 


GPU, sec. 


5.2 


17.9 


52.8 


120.6 


GPU utilization 


50% 


92% 


97% 


98% 


Speed-up, times 


5.8 


15.3 


37.6 


31.5 





8 X 16^ 


8 X 24^ 


8 X 32^ 


CPU, sec. 


744 


2060 


7920 


GPU, sec. 


32.7 


102.3 


397.4 


GPU utilization 


95% 


98% 


99% 


Speed-up, times 


22.8 


20.1 


19.9 



Table 1: Performance comparison of serial (CPU) and parallel (GPU) MC simulations for 
pure SU{2) gluodynamics on some lattices. 

The performance comparison of serial and parallel MC simulations for different 
lattices is collected in Table 1. In the first row the lattice geometry is shown. The 
next row contains the execution time of Fortran-program, compiled with the trial 
version of Intel Fortran compiler 11 (last available version) [37]. The software and 
hardware configurations of simulation machines are presented in Appendix D. The 
execution time of GPU-oriented program is shown in the third row. The fourth 
row comprises percentage of GPU utilization during the program execution. For 
smallest lattice (4x8'^) this value is low because of low density of kernel operations in 
the whole execution process. The total speed-up of parallel (GPU) MC simulations 
in comparison with serial (CPU) one is shown in the last row. It can be seen that 
for small lattices (4 x 8"^, 4 x 16"^) the gain in productivity is not too high (5.8 and 
15.3 times, respectively). But for bigger lattices it varies from 19.9 to 37.6 times. 
We believe that in full QCD simulations the total speed-up should be much higher, 
because of big number of arithmetic operations per every MC sweep. 

3 Discussion 

The main purpose of the present work is, on the one hand, to demonstrate how 
MC simulations can be realized on GPU, and, on the other hand, to estimate 
performance gain of implemented algorithms with ATI Stream SDK. Here we inves- 
tigated two basic local updating schemes, the Metropolis and heat-bath algorithms, 
by example of simplest physical systems {2d Ising and SU(2) gluodynamics, respec- 
tively). We found significant performance increase of 10 — 50 x in GPU numerical 
simulations in comparison with their execution on contemporary CPUs. The corre- 
sponding source codes can be sent via e-mails by request (see front page for e-mail 
addresses). 

Let us point out the key features of the implementation presented here. First 
of all, it should be noted that we utilized a beta version of ATI Stream SDK, in 
particular, some useful features (such as LDS) were not realized to the full extent. 
For instance, to avoid unstable behavior of the program, we were forced to perform 
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reduction via global buffer directly, though, it was more natural to utiHze special 
designed hardware unit (LDS) for this purpose. Since this shortcoming might affect 
the applications performance, it is very desirable, therefore, that it will be fixed in 
future releases of ATI Stream sdkI. The other obvious point is that our applications 
can be run on ATI hardware only. However, recently approved standard for Open 
Computing Language (OpenCL) will give an opportunity to develop crossplatform 
applications, even for heterogeneous computational systems (CPUs+GPUs). The 
support for this new technology is already expected this year from almost all lead- 
ing hardware manufacturers [39]. As a particular example, OpenCL was already 
included in MAC OS X Snow Leopard |40j . Finally, it should be emphasized that 
during the GPU simulations the CPU is actually free, so the data postprocessing 
may be combined with the next simulations. Moreover, the programming model al- 
lows simultaneously use up to 8 GPU units per host, so up to 8 points (for example, 
different initial lattice parameters) may be obtained at one run. 

One of the main directions for our future research is full QCD simulations on 
GPU. As known, realization of fermionic sector on the lattice is the most resource- 
intensive task, requiring laborious matrix-algebra computations (see for recent re- 
view Ref . [9] and references therein) . A task of such kind is especially welcome for 
GPU implementation and some discussion of this topic can be already found in the 
literature [7], [8]. Indeed, one may expect an essential growth of the performance in 
this case due to high intensity of matrix arithmetic operations in such simulations 
with relatively low memory import-export operations that are the bottleneck for 
GPU kernels. 



A ATI Stream technology overview 

ATI Stream technology is a set of advanced hardware and software technologies that 
enable ATI graphics processors, working in concert with the CPU, to accelerate 
applications beyond just graphics [29], [30], [31] . 



Ultra-Threaded Dispatch Processor 




I Data Request Bus | 

Figure 3: Generalized structure (left) and simplified block diagram (right) of the 
stream processor. 



A stream is a collection of data elements of the same type that can be oper- 
ated in parallel. A series of operations, which are applied to each element in the 
stream, are called stream kernels (or simply kernels). It is possible to run kernel on 
some rectangular region of the stream (the so-called domain of execution) . Stream 

^During preparation of this report ATI team released new version of their programming toolkit, 
ATI Stream 1.4 beta. We did not test this new version yet. 
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Figure 4: Simplified block diagram of the VLIW thread processor. 



processor (or GPU) is the hardware, which can operate the streams and execute 
kernels. 

A modern graphics hardware can contain one or more stream processors. Differ- 
ent stream processors have different characteristics, but follow a similar design pat- 
tern. In the Fig. 3 the generalized structure and simplified block diagram of stream 
processor are presented. Each stream processor comprises of numerous thread pro- 
cessors that are grouped into single-instruction multiply-data (SIMD) engines (see 
[30] and Fig. 3). Every SIMD engine contains the local data share unit {LDS) 
aimed to fast transfer data between the thread processors in a SIMD core. Thread 
processors contain stream cores, which are the fundamental, programmable com- 
putational units, responsible for performing integer, single precision fioating point, 
double precision fioating point and transcendental operations and the data types 
conversation. All thread processors within a SIMD engine execute the same in- 
struction sequence; different SIMD engines can execute different instructions. Each 
instance of a kernel running on SIMD engine's thread processor is called a thread. 
The group of threads that are executed together is called a wavefront. The number 
of such threads in wavefront is called wavefront size. The total execution time for 
the wavefront is determined by the thread with the longest execution time. 

A thread processor is arranged as a five- way VLIW processor (Fig. 4). Up to 
five scalar operations can be co-issued in a very long instruction word ( VLIW) . To 
hide latencies due to memory access and stream core operations, multiple threads 
are interleaved. In a thread processor, up to four threads can issue four VLIW 
instructions over four cycles. Stream cores can execute single-precision fioating 
point or integer operations. One of the five stream cores ( T-stream core) also can 
handle transcendental operations. Double-precision fioating point operations are 
processed by connecting four of the stream cores (excluding the transcendental 
core) to perform a single double-precision operation. The thread processor also 
contains one branch execution unit to handle branch instructions. 



B General information about ATI RV770 videochip 

The most recent ATI Radeon architecture RV770 was released on June 25, 2008. 
There are several CPUs based on RV770 (HD 4870, HD 4850, etc.). The RV770 fea- 
tures a 256-bit memory controller and is the first GPU to support GDDR5 memory 
(HD 4870 and HD 4870X2 cards are equipped with the GDDR5 memory), which 
possesses a quadruple effective data transfer rate relative to its physical clock rate 
(900 MHz*4=3600 MHz for HD 4870 and HD 4870X2), instead of double as with 
GDDR3 memory. The bandwidth of GDDR5 is up to 115 GB/s. All mentioned 
GPU cores are produced with the 55 nm fabrication process. 

ATI's RV770 cards may be equipped with the 512, 1024 or 2048MB of memory. 
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All mentioned engines are based on the system bus PCI Express 2.0 xl6, but also 
support PCI Express 1.1 and 1.0a. 

The general information about some ATI's RV770 video cards is presented in 
Table 2. The last two columns are the theoretical peak performance in single- 
precision floating point operations (in Tflops) and thermal design power (TDP), 
maximal value describing the thermal limits of cards (in Watts) , respectively. 



Model 


Core 


Memory 


Bandwidth 


Bus width 


Tflops 


TDP 




(MHz) 


(MHz) 


(GB/s) 


(bit) 


(peak) 


(W) 


HD 4850 


625 


993 


64 


256 


1.0 


110 


HD 4870 


750 


900* 


115.2 


256 


1.2 


150 


HD 4850X2 


2x625 


993 


2x64 


2x256 


2.0 


230 


HD 4870X2 


2x750 


900* 


2x115.2 


2x256 


2.4 


286 



Table 2: General information about some ATI's video cards [32]. 

Graphics cards HD 4870X2 and HD 4850X2 have two stream processors each, 
the remaining RV770 cards have one stream processor only. Every stream processor 
of RV770 comprises 800 thread processors that are grouped into 10 SIMD cores (see 
[30) and Fig. 3). The LDS size for every RV770 SIMD is 16KB. Thread processors 
of RV770 contain four stream cores and one T-stream core, which are also named 
ALU.[x,y,z,w] and ALU.t, respectively (see Fig. 4). Each pair of stream cores 
(i.e., ALU.x, ALU.y and ALU.z, ALU.w) may be grouped into one 64-bit ALU 
for double-precision operations. The RV770 thread processor may co-issue up to 6 
operations (flve arithmetical/logical and one flow control instruction). It also can 
perform 1.25 machine scalar operation per clock for each of 64 data elements. 

C ATI Stream SDK overview 

There are two main ATI Stream environments, which are composed into one com- 
plete software development kit: 

• ATI Compute Abstraction Layer (CAL), 

• Brook+. 

ATI CAL is a device library that provides a forward-compatible interface to ATI 
stream processors. ATI CAL lets software developers interact with the stream pro- 
cessors cores at the lowest-level for optimized performance, while maintaining for- 
ward compatibility. ATI CAL is ideal for performance-sensitive developers because 
it minimizes software overhead and provides full-control over hardware-specific fea- 
tures that might not be available with higher-level tools. 

Brook is the high-level stream programming language for using modern graphics 
hardware for non-graphical computations. Brook+ is AMD's modified Brook open 
source data-parallel C++ compiler [30]. Being built on top of ATI CAL, Brook+ 
source-to-source meta-compiler translates programs into device-dependent kernels 
with the virtual instruction set architecture (ISA) [33], the so-called ATI Interme- 
diate Language (IL). The generated C++ source includes the CPU code and the 
stream processor device code, both of which are later linked into the executable. 
Brook+ is a higher-level language that is easier to use, but does not provide all the 
functionality that ATI CAL does. 

ATI IL is a pseudo-assembly language that can be used to describe kernels for 
stream processors [34]. It is designed for efficient generalization of stream processor 
instructions, so that programs can run on a variety of platforms without having to 
be rewritten for each platform. 
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ATI CAL API comprises one or more stream processors connected to one or 
more CPUs by a high-speed bus. The CPU runs the stream processor device driver 
program and controls the stream processors by sending commands using the ATI 
CAL API. Stream processor can read from and write to its own local stream pro- 
cessor memory and system memory using PCI Express bus. ATI CAL API exposes 
the stream processors as SIMD array of computational processors. 

The CAL runtime comprises system initialization and query, device, context and 
memory management, program loading and execution. 

In CAL all physical memory blocks allocated by the application for the use 
in stream kernels are referred to as resources. These blocks can be allocated as 
one-dimensional or as two-dimensional arrays of data. There are several memory 
organizing structures, which may be used with the CAL kernels: input buffers, 
output buffers, constant buffers, scratch buffers, global buffers. The input buffer 
can only be read from, not written to. The output buffer can only be written to, not 
read from. Constant buffer is the off-chip memory that contains constants. Scratch 
buffer is a variable-sized space in off-chip memory that stores some of the general 
purpose registers {GPRs). The global buffer lets applications read from, and write 
to, arbitrary locations of memory. Global buffers use a linear memory layout. All 
these buffers support the following data types: signed or unsigned 32-bit integer, 
32-bit floating point (single precision), 64-bit floating point (double precision). A 
128-bit address mapped memory space consisting of four 32-bit components is called 
a register. It is accepted to mark its components as .x, .y, .z and .w, respectively. For 
double-precision operations the pair of 32-bit components are grouped into 64-bit 
values (.xy and .zw). 

ATI IL is a typeless language - the value in a register has no intrinsic type, it is 
simply 32 or 64 bits of data [34]. Each mathematical instruction performs a typed 
computation - signed or unsigned integer, floating point, double-precision floating 
point - on one or more untyped operands. 

Unfortunately, there are some points, which make the programming on GPU low- 
level SDK complicate. For instance, ATI CAL does not support recursive execution 
of kernels as well as the execution of a kernel from the other. Also, there are no 
runtime debugging tools in ATI CAL. And flnally, ATI Stream SDK still has beta 
status, which may implies an occurrence of some compilation errors. 

Another strong point of ATI Stream SDK is the fact that ATI drivers (ATI 
Catalyst) from version 8.12 contain all necessary libraries for stream computing, so 
the compiled program could be directly executed on any computer system equipped 
with the ATI Radeon cards without the installation of additional software. 

D Used hardware/software configuration 

Hardware conflguration: 

• Ising model: Intel Core 2 Duo CPU E6550 @ 2.33GHz, 4GB RAM memory; 

• SU(2): Intel Core 2 Quad CPU Q6600 @ 2.40GHz, 4GB RAM memory. 
Graphics card speciflcation: 

• Graphics card: AMD/ ATI Radeon HD 4850 

• GPU model: RV770Pro 

• Number of thread processors: 160 (combined in 10 SIMD cores) 

• Number of local memory fetch units: 40 (4 for each SIMD core) 

• Engine clock: 625 MHz 

• Memory clock: 993MHz (GDDR3 type, effective memory clock - 1984MHz) 

• Memory bus: 256 bit 
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• Memory size: 512MB 
Software configuration: 

• OS: Windows XP SP3 (32-bit) 

• Used driver: ATI Catalyst 9.1, [35] 

• MS Visual Studio 2008 Express edition (C++ compiler), [36] 

• ATI Stream SDK v. 1.3 beta, [38] 
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